Similarity Joins of Text with Incomplete Information Formats

نویسندگان

  • Shaoxu Song
  • Lei Chen
چکیده

Similarity join over text is important in text retrieval and query. Due to the incomplete formats of information representation, such as abbreviation and short word, similarity joins should address an asymmetric feature that these incomplete formats may contain only partial information of their original representation. Current approaches, including cosine similarity with q-grams, can hardly deal with the asymmetric feature of similarity between words and their incomplete formats. In order to find this type of incomplete format information with asymmetric features, we develop a new similarity join algorithm, namely IJoin. A novel matching scheme is proposed to identify the overlap between two entities with incomplete formats. Other than q-grams, we reconnect the sequence of words in a string to reserve the abbreviated information. Based on the asymmetric features of similar entities with incomplete formats, we adopt a new similarity function. Furthermore, an efficient algorithm is implemented by using the join operation in SQL, which reduces pairs of tuples in similarity comparison. The experimental evaluation demonstrates the effectiveness and the efficiency of our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Joins for Data Cleansing and Integration in an RDBMS

An organization’s data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching can be effectively performed via the cosine similarity metric from t...

متن کامل

Assessing Reading Comprehension of Expository Text across Different Response Formats

This study investigated if different response formats (test methods) measure reading comprehension of expository text differently. The study was conducted with 48 semester 6 TESL students at a university in Selangor, Malaysia. These students received an expository passage having descriptive rhetorical structure followed by three response formats, namely, incomplete outline, graphic organizer, a...

متن کامل

SHAPLEY FUNCTION BASED INTERVAL-VALUED INTUITIONISTIC FUZZY VIKOR TECHNIQUE FOR CORRELATIVE MULTI-CRITERIA DECISION MAKING PROBLEMS

Interval-valued intuitionistic fuzzy set (IVIFS) has developed to cope with the uncertainty of imprecise human thinking. In the present communication, new entropy and similarity measures for IVIFSs based on exponential function are presented and compared with the existing measures. Numerical results reveal that the proposed information measures attain the higher association with the existing me...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007